======================== Adding and Managing Data ======================== ---------------------- Creating a New Project ---------------------- 1. Go to the :ref:`Projects` view and press **Create**. .. image:: /_static/new-project2.png :scale: 80% 2. Enter the new project title, the users and administrators, then press **Create Project**. 3. Your new project should be visible in the projects list. .. image:: /_static/new-project3.png :scale: 60% 4. You should automatically be using your new project. If you have the correct rights, you can now add indices to the project (see more information in :ref:`Indices`). |br| ------ ------------------------------------- Adding Data with the Dataset Importer ------------------------------------- 1. Open up the **Tools** menu and pick **Dataset Importer**. Press **Create** to create a new dataset importer. .. image:: /_static/dataset-importer1.png :scale: 60% 2. Choose the task description and the dataset name (the name of the created index). The index name should be unique, otherwise the import might fail or add the data to an existing index. 3. Then choose a file to upload (in jsonlines, CSV or Excel formatting). Jsonlines documents are preferred for importing datasets. Press **Create**. .. image:: /_static/dataset-importer2.png :scale: 60% In this example a jsonlines document will be used that was exported previously in :ref:`Downloading Data`. 4. To see the process update, refresh the page. When the task is completed, you will be able to access your index from the project automatically. .. note:: Don’t forget to check if all your data was imported from the process details (*Documents* vs *Documents success*). Access the details by clicking on the task. .. image:: /_static/dataset-importer-success.png *All is correct* |br| ------ --------------- Reindexing Data --------------- **Reindexing allows you to:** * rename an existing index * merge different indices together * get a fixed number of documents from an index * use a query to get a subset of an existing index If you are planning to process the data or use any tools, then it’s always a good idea to do it on a copy or subset of an index. 1. Click on the **Tools** menu and choose the **Reindexer**. Press **Create** to make a new task. .. image:: /_static/reindexer1.png :scale: 60% 2. Put in the description of the reindexing task and select a new index name. The new index name should be unique, otherwise the reindexing task might fail or add the data to an existing index. 3. Choose the index or indices you wish to reindex. If you choose two or more indices, the result will be a merged index. .. image:: /_static/reindexer_fields.png :scale: 60% 4. Then choose the fields you wish to include, by default all the fields are selected. You can deselect any fields to get rid of them in the new index. .. image:: /_static/reindexer_query.png 5. You can use a query to select a subset of documents that match the query. This is a way to create a subset of documents from other indices. .. image:: /_static/reindexer_subset.png 6. You can also use the random subset size to get a fixed amount of documents from the other index/indices. This can also be combined with the query option. .. image:: /_static/reindexer_facts_map.png 7. Don’t forget to add facts mapping to the new index, although if you are reindexing an index with facts this should turn on by default. Fact mapping is needed for facts to work correctly. 8. Press **Create** to start the reindexing task. Refresh to see the process updates. The new index should be added automatically to the project when created. |br| ------ -------------------------- Add New Indices to Project -------------------------- You've created a new index, but want to move it to a different project. Here's how to add that index to your project: 1. If you have the correct rights, you can add new indices by clicking on the pencil button on the blue menu ribbon, going into **Edit project**. .. image:: /_static/edit-project.png 2. Find the index or indices you want to add to the project. You can type to find the index quicker. .. image:: /_static/edit-project3.png :scale: 80% 3. Add a tick to the box in front of the index name, then click **Save changes**. .. image:: /_static/edit-project2.png :scale: 80% After this, :ref:`pick the index ` you just added to look at it in Searcher. |br| ------ ------------------------------ Removing an Index from Project ------------------------------ Removing an index from a project is very similar to adding new indices to a project. .. image:: /_static/edit-project.png 1. If you have the correct rights, you can remove any unneeded indices by clicking on the pencil button on the blue menu ribbon, going into **Edit project**. .. image:: /_static/edit-project2.png :scale: 80% 2. Find the index or indices you want to remove from the project. You can type to find the index quicker. .. image:: /_static/edit-project3.png :scale: 80% 3. Remove the tick from the box in front of the index name. .. image:: /_static/edit-project4.png :scale: 80% 4. Then click **Save changes**. |br| ------ ------------------ Splitting an Index ------------------ The Index Splitter allows you to create two indices from one index, whether you are using it to divide your data into two sub-corpora or create a test and train set for a model. 1. To split an index, open the **Tools** menu and click on **Index Splitter**, press **Create**. .. image:: /_static/index-splitter1.png :scale: 60% 2. Choose a description and put in the indices or index you are trying to split. 3. Choose the fields you want or get all the fields by default. 4. You can use a query here to get a subset of data. .. image:: /_static/index-splitter2.png :scale: 60% 5. Pick the names of your new two indices. In this menu they are referred to as train and test, but they can be anything you wish. However, the names of new indices should be unique, otherwise the task will fail or the documents might be added to an existing index. 6. After picking the new indices names you can choose the percentage or amount of documents that will be in each index with **Test size**. 7. You can adjust the **Distribution** by using an interesting fact’s proportions. By default the distribution will be random (documents will be put into one of two indices randomly). Other options are to keep the distribution similar to the original situation by using a fact name or fact value or to make the distribution equal (make the test index 50/50 in regards to this fact). For example, if we have an index containing 100 documents that has 20% social sciences and 80% humanities documents, and we would like to split these documents into two indices, we have a lot of options: +-----------------------------+----------+--------------+------------+-------------------------+-------------------------------------------------------+ | Purpose | Test size| Distribution | Fact name | Query | Result | +=============================+==========+==============+============+=========================+=======================================================+ | We want only social sciences| 50% | random | None | get only social sciences| each index contains just 10 social sciences documents | +-----------------------------+----------+--------------+------------+-------------------------+-------------------------------------------------------+ | We want to split in half | 50% | random | None | None | each index has 50 documents, but all the social | | regardless of distribution | | | | | sciences documents could be in just one index | +-----------------------------+----------+--------------+------------+-------------------------+-------------------------------------------------------+ | Split in half, retaining | 50% | original | fact name: | None | each index has 10 social sciences and 40 humanities | | original distribution | | | Discipline | | documents | +-----------------------------+----------+--------------+------------+-------------------------+-------------------------------------------------------+ | Make the distribution in | 10 | equal | fact name: | None | one index has 10 social sciences and 10 humanities | | one index equal | | | Discipline | | documents, the other has the rest of the documents | +-----------------------------+----------+--------------+------------+-------------------------+-------------------------------------------------------+ | Make both indices | Use custom distribution, **look at the example below**. | | distribution equal | | +-----------------------------+------------------------------------------------------------------------------------------------------------------------+ .. figure:: /_static/index-splitter3.png :scale: 60% *An example of using equal distribution.* .. figure:: /_static/index-splitter4.png *Result in the split index ba_ma_soc_hum_2.* .. figure:: /_static/index-splitter5.png :scale: 60% *An example of using custom distribution.* .. figure:: /_static/index-splitter6.png *Result in the split index ba_ma_custom2.* .. |br| raw:: html